Conversation
…loc, run, and report
📝 WalkthroughWalkthroughAdds a new QTL Cascade feature: a qtlcascade package (cascade construction, ABF colocalization, pipeline orchestration, plotting, reporting), CLI subcommands to build/run/report cascades, new eQTL/pQTL table builders and registry entries, docs and tests; plus assorted formatting-only edits elsewhere. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as "hvantk CLI\n(qtlcascade)"
participant Hail as "Hail Tables\n(eQTL/pQTL)"
participant Cascade as "Cascade Builder"
participant Coloc as "Colocalization"
participant GeneSum as "Gene Summary"
participant Plot as "Plotting"
participant Report as "Report Generator"
participant FS as "Filesystem (outputs)"
User->>CLI: hvantk qtlcascade run
CLI->>Hail: load(eQTL_ht), load(pQTL_ht)
CLI->>Cascade: build_cascade(eqtl_ht, pqtl_ht, thresholds, tissue)
Cascade->>Hail: outer-join on (locus, alleles, gene_id)
Cascade->>Cascade: classify rows (eqtl_mediated/discordant/eqtl_only/pqtl_only)
Cascade->>FS: write cascade.ht
CLI->>GeneSum: build_cascade_gene_summary(cascade.ht, overlays)
GeneSum->>FS: write gene_summary.ht / .tsv
alt colocalization enabled
CLI->>Coloc: run_coloc_per_gene(allpairs_eqtl, allpairs_pqtl, genes)
Coloc->>FS: write coloc_results.tsv
end
CLI->>Plot: plot_* (class counts, attenuation, coloc posteriors)
Plot->>FS: save PNGs
CLI->>Report: generate_report(output_dir, inputs...)
Report->>FS: write HTML report
FS->>User: return report path / summary outputs
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR adds a new QTL Cascade module to hvantk, enabling end-to-end molecular QTL integration workflows (eQTL + pQTL table building, cascade join/classification, optional colocalization, plots, and HTML reporting) exposed via both CLI and documented user workflows.
Changes:
- Added new
hvantk qtlcascade ...CLI group plus a stage-basedCascadePipeline(single- and multi-tissue execution) with plotting + HTML reporting. - Added new
hvantk mktable eqtlandhvantk mktable pqtlbuilders (GTEx v11/v8, eQTLGen, Fang pQTL) and registered them in the table registry. - Added extensive documentation + examples for QTL Cascade usage and data sources.
Reviewed changes
Copilot reviewed 40 out of 42 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| hvantk/tables/table_builders.py | Adds eQTL/pQTL table builders + shared helpers for parsing GTEx-style variant IDs and scanning per-tissue inputs. |
| hvantk/tables/registry.py | Registers new eqtl / pqtl table builders in the table registry. |
| hvantk/qtlcascade/init.py | Exposes the public QTL Cascade API surface (pipeline, coloc, plotting, reporting). |
| hvantk/qtlcascade/constants.py | Introduces cascade class constants, default thresholds/priors, tissue mappings, and source identifiers. |
| hvantk/qtlcascade/cascade.py | Implements the core eQTL⊕pQTL outer-join + cascade classification. |
| hvantk/qtlcascade/coloc.py | Implements ABF coloc logic and a Hail→pandas driver for per-gene coloc. |
| hvantk/qtlcascade/gene_summary.py | Aggregates cascade HT results to gene-level summaries with optional overlays. |
| hvantk/qtlcascade/pipeline.py | Adds stage-based orchestration for cascade → summary → coloc → outputs, including multi-tissue collection mode. |
| hvantk/qtlcascade/plot.py | Adds matplotlib-based plots for cascade classes, coloc posteriors, cross-tissue heatmap, etc. |
| hvantk/qtlcascade/report.py | Adds a static self-contained HTML report generator. |
| hvantk/commands/qtlcascade_cli.py | Adds click CLI commands for cascade, coloc, run (pipeline), and report generation. |
| hvantk/commands/make_table_cli.py | Adds mktable eqtl and mktable pqtl subcommands wiring into the new builders. |
| hvantk/hvantk.py | Registers the new qtlcascade command group in the top-level CLI. |
| docs_site/tools/qtlcascade.md | New full documentation page for QTL Cascade (CLI + Python API + outputs). |
| docs_site/guide/data-sources.md | Adds QTL data sources guidance (GTEx/eQTLGen/Fang pQTL). |
| docs_site/guide/usage.md | Adds usage examples for building eQTL/pQTL tables via mktable. |
| docs_site/examples/qtlcascade.md | Adds an end-to-end QTL Cascade example workflow. |
| docs_site/examples/index.md | Links the new QTL Cascade example from the examples index. |
Comments suppressed due to low confidence (1)
hvantk/tables/table_builders.py:1866
- The reverse mapping table is keyed by
gene_nameand then indexed viarev[ht.gene_symbol]. Ifgene_nameis not unique in the mapping HT, this can produce non-unique keys and make indexing ambiguous or error. Consider validating uniqueness of gene_name (or grouping/deduplicating deterministically) before using it as a key.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # Annotate counts | ||
| for i, v in enumerate(counts): | ||
| ax.text(v + max(counts) * 0.01, i, f"{v:,}", va="center", fontsize=10) |
There was a problem hiding this comment.
plot_cascade_classes() will raise when class_counts is empty (or contains none of the known CASCADE_CLASSES) because max(counts) is called during annotation. Consider returning an empty/"No data" plot (similar to plot_attenuation) when classes/counts is empty before calling max().
| for ext in extensions: | ||
| matches.extend(sorted(p.glob(f"*{ext}"))) | ||
| if not matches: | ||
| raise FileNotFoundError(f"No files matching {extensions} in {input_path}") | ||
| return [(str(f), f.stem.split(".")[0]) for f in matches] |
There was a problem hiding this comment.
_parse_gtex_variant_id() uses the contig string from the GTEx-style variant_id (e.g. "chr1") directly in hl.locus(..., reference_genome=...). For GRCh37, Hail contigs are typically "1" (no "chr" prefix), so this can fail when --ref-genome GRCh37 is used. Consider normalizing contigs based on reference_genome (strip/add "chr" as needed) before constructing the locus.
| result = self._stage_build_cascade(result, tissue, tissue_dir) | ||
|
|
||
| # Stage 2: Gene summary | ||
| result = self._stage_gene_summary(result, tissue_dir) | ||
|
|
There was a problem hiding this comment.
run() builds the gene summary before running coloc. Since build_cascade_gene_summary() takes coloc_df to overlay coloc_max_h4/coloc_n_tissues, gene_summary.ht will never include coloc annotations with the current stage order. Consider running coloc before gene summary, or regenerating/patching the gene summary after coloc completes.
| @click.group( | ||
| name="qtlcascade", | ||
| help="Molecular QTL cascade analysis (eQTL → pQTL → disease).", | ||
| context_settings=CONTEXT_SETTINGS, | ||
| ) |
There was a problem hiding this comment.
This new qtlcascade CLI module isn't covered by the existing CLI test suite yet (there are many click command tests under hvantk/tests/). Consider adding basic invocation tests (patching CascadePipeline/build_cascade/run_coloc_per_gene) to ensure option wiring and exit codes are stable.
| EQTL_SOURCES = ("gtex_v11", "gtex_v8", "eqtlgen") | ||
| PQTL_SOURCES = ("gtex_fang", "ukb_ppp", "decode", "sun_2018") |
There was a problem hiding this comment.
PQTL_SOURCES lists several identifiers (e.g. "ukb_ppp", "decode", "sun_2018") but create_pqtl_tb currently raises NotImplementedError for anything except "gtex_fang". This is likely to confuse downstream callers (and can break CLI choice generation if it ever uses PQTL_SOURCES). Consider restricting PQTL_SOURCES to implemented sources for now, or splitting into SUPPORTED vs IMPLEMENTED constants.
| ht = hl.read_table(result.cascade_ht_path) | ||
| cascade_genes = list( | ||
| ht.filter(ht.cascade_class == "eqtl_mediated").aggregate( | ||
| hl.agg.collect_as_set(ht.gene_id) |
There was a problem hiding this comment.
_stage_coloc() says it is getting "genes with both eQTL and pQTL", but it filters to cascade_class == "eqtl_mediated" only. Genes with both signals but discordant direction (cascade_class == "discordant") will be excluded from coloc. Consider including discordant genes too if the intent is "both present" rather than "concordant only".
| rows = [] | ||
| for tissue, res in results.items(): | ||
| if res.class_counts: | ||
| for cls, cnt in res.class_counts.items(): | ||
| rows.append( |
There was a problem hiding this comment.
_generate_cross_tissue_outputs() populates the heatmap input by setting gene_id to the cascade class label and n_concordant to the class count. This doesn't match plot_cross_tissue_heatmap()'s documented input (genes × tissues) and will produce a misleading plot. Consider either plotting class counts explicitly (rename fields/value_col) or building a true gene×tissue matrix from gene summaries.
| @click.pass_context | ||
| def run_cmd( | ||
| ctx, | ||
| eqtl_ht, | ||
| pqtl_ht, |
There was a problem hiding this comment.
The multi-tissue summary in run_cmd hardcodes the coloc threshold as 0.8. Since DEFAULT_COLOC_H4_THRESHOLD is already defined in hvantk.qtlcascade.constants and used elsewhere, consider referencing the constant here to keep behavior consistent if the default changes.
| input_path=input_path, | ||
| output_path=output_path, | ||
| import_func=import_func, |
There was a problem hiding this comment.
The docstring says tissue can "override inferred tissue name", but the import helpers always set tissue=tname from the filename and use the tissue parameter only as a filter. Either update the docs to reflect filter-only behavior or implement an override (e.g., when tissue is provided for a single file, set the output tissue field to that value).
| logger.info("Overlaying disease-gene labels from %s", disease_genes_ht_path) | ||
| disease_ht = hl.read_table(disease_genes_ht_path) | ||
| gene_ht = gene_ht.annotate( | ||
| is_disease_gene=hl.is_defined(disease_ht.index(gene_ht.gene_id)), | ||
| ) |
There was a problem hiding this comment.
The disease-gene overlay indexes disease_ht by gene_ht.gene_id, which assumes the provided disease table is keyed by Ensembl gene_id. hvantk's ClinGen gene-disease table builder keys by hgnc_id by default, so passing a ClinGen HT (as suggested in docs/CLI) will produce no matches. Consider validating the expected key or supporting hgnc_id via an HGNC/Ensembl mapping join.
There was a problem hiding this comment.
Actionable comments posted: 17
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
hvantk/tests/test_ptm.py (1)
224-233:⚠️ Potential issue | 🟡 MinorUse
tmp_pathfixture instead of hardcoded/tmp/testpaths.Line 232 uses hardcoded
/tmp/testpaths for test validation. This is brittle (permissions, cleanup, Windows compatibility) and inconsistent with thetmp_pathfixture already used intest_map_ptm_sites_roundtrip(line 133).🧪 Proposed fix
-def test_ptm_build_config_validation(): +def test_ptm_build_config_validation(tmp_path): """PTMBuildConfig.validate catches missing required fields.""" config = PTMBuildConfig() errors = config.validate() assert len(errors) == 2 assert any("output_dir" in e for e in errors) assert any("output_ht" in e for e in errors) - config2 = PTMBuildConfig(output_dir="/tmp/test", output_ht="/tmp/test/out.ht") + config2 = PTMBuildConfig( + output_dir=str(tmp_path), + output_ht=str(tmp_path / "out.ht") + ) assert config2.validate() == []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/tests/test_ptm.py` around lines 224 - 233, Update the test_ptm_build_config_validation to use the pytest tmp_path fixture instead of hardcoded "/tmp/test" strings: add tmp_path as a test parameter, set output_dir to str(tmp_path) (or tmp_path.as_posix()) and output_ht to str(tmp_path / "out.ht") when constructing PTMBuildConfig (refer to PTMBuildConfig and config2 in the test), so the test uses an ephemeral, OS-independent directory and file path.
🧹 Nitpick comments (6)
hvantk/datasets/uniprot_ptm_datasets.py (1)
323-323: Use explicit f-string conversion for exception text.Ruff’s
RUF010applies here: prefer{e!s}instead ofstr(e)inside the f-string.Suggested diff
- raise RuntimeError(f"Failed to download UniProt PTM data: {str(e)}") from e + raise RuntimeError(f"Failed to download UniProt PTM data: {e!s}") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/datasets/uniprot_ptm_datasets.py` at line 323, Replace the explicit str(e) conversion inside the f-string with the f-string conversion flag {e!s} on the raise RuntimeError line (i.e., change raise RuntimeError(f"Failed to download UniProt PTM data: {str(e)}") from e to use {e!s}) so the exception text uses the explicit f-string conversion and satisfies RUF010; keep the "from e" chaining intact.hvantk/data/gencc_streamer.py (1)
274-276: Consider using a single f-string for cleaner formatting.The current approach using two adjacent f-strings works correctly but is somewhat unconventional. A single f-string would be more idiomatic and readable:
logger.info( - f"Found {len(genes)} consensus genes " f"(min_submitters={min_submitters})" + f"Found {len(genes)} consensus genes (min_submitters={min_submitters})" )Also noting that this formatting change appears unrelated to the PR's stated objectives (QTL Cascade and eQTL/pQTL features).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/data/gencc_streamer.py` around lines 274 - 276, The logger.info call currently uses two adjacent f-strings; replace them with a single f-string in the logger.info(...) invocation so the message reads in one formatted string referencing len(genes) and min_submitters (e.g., inside the same f-string), keeping the call to logger.info and the variables genes and min_submitters unchanged for readability.hvantk/ptm/plot.py (1)
120-140: Consider addingstrict=Trueto zip() calls for safer iteration.Lines 120 and 130 use
zip()without an explicitstrict=parameter. While the arrays are correctly paired in this context, addingstrict=True(Python 3.10+) would catch length mismatches earlier if the code changes in the future.♻️ Optional enhancement
- for bar, n in zip(bars_p, path_counts): + for bar, n in zip(bars_p, path_counts, strict=True): if n > 0: ax1.text( bar.get_x() + bar.get_width() / 2, bar.get_height(), str(n), ha="center", va="bottom", fontsize=8, ) - for bar, n in zip(bars_b, benign_counts): + for bar, n in zip(bars_b, benign_counts, strict=True):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/plot.py` around lines 120 - 140, The zip() calls iterating over bars_p/path_counts and bars_b/benign_counts should be made strict to catch future length mismatches; update the two zip(...) calls to use zip(..., strict=True) when iterating in the loops that call ax1.text (referencing symbols bars_p, path_counts, bars_b, benign_counts) so a ValueError is raised if lengths differ.hvantk/commands/ptm_cli.py (1)
191-191: Remove redundanthailimport inptm_annotate.
hailis already imported at Line 180 in the sametryblock, so Line 191 is duplicate and can be dropped for clarity.♻️ Minimal cleanup
- import hail as hl - counts = result.aggregate( hl.struct( n_total=hl.agg.count(),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/commands/ptm_cli.py` at line 191, The duplicate import of hail inside the ptm_annotate try block should be removed: there is already an import hail earlier in that same try (around Line 180), so delete the redundant "import hail" statement at the later occurrence to avoid duplicate imports and improve clarity in the ptm_annotate function/block.hvantk/qtlcascade/gene_summary.py (1)
75-76: Defensive code forgene_symbolthat currently won't trigger.Per
cascade.pylines 91-124,build_cascadedoes not create agene_symbolfield in its output. This conditional will never evaluate to true with the current cascade builder implementation.If this is intentional future-proofing for potential schema changes or external cascade tables, consider adding a brief comment to clarify the intent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/qtlcascade/gene_summary.py` around lines 75 - 76, The conditional checking "gene_symbol" in row_fields inside gene_summary (the agg_exprs["gene_symbol"] = hl.agg.take(ht.gene_symbol, 1)[0] branch) is dead with the current cascade builder (see build_cascade in cascade.py not emitting gene_symbol); either remove the unreachable conditional and associated agg_expr, or keep it but add a brief clarifying comment above referencing build_cascade and that this is intentional defensive/future-proofing for external cascades/schema changes so reviewers understand it is deliberate. Ensure the comment names the symbols row_fields, agg_exprs, gene_symbol, and build_cascade to make the rationale discoverable.hvantk/qtlcascade/__init__.py (1)
76-113: Consider sorting__all__for lint stability.Ruff reports
RUF022; sorting avoids unnecessary lint noise/churn in future edits.Suggested approach
-__all__ = [ - ... -] +__all__ = sorted([ + ... +])Or keep a manually sorted static list if you prefer explicit ordering.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/qtlcascade/__init__.py` around lines 76 - 113, The __all__ export list is unsorted which triggers RUF022; sort the entries in the __all__ list (e.g., "build_cascade", "build_cascade_gene_summary", "CASCADE_CLASS_COLORS", "CASCADE_CLASSES", "CascadeConfig", "CascadePipeline", "CascadeResult", "DEFAULT_COLOC_H4_THRESHOLD", etc.) into a stable lexicographic order (or replace with a manually sorted static ordering) so the list is deterministic and lint-stable; update the existing __all__ definition accordingly while preserving all original names.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs_site/examples/index.md`:
- Line 123: Update the "Outputs:" line so it clearly marks coloc results as
conditional on providing allpairs inputs: change the Outputs list (the line
beginning "**Outputs:**") to indicate that "coloc results (TSV)" are only
produced if the allpairs inputs are passed (for example add "(if allpairs inputs
provided)" or similar wording next to "coloc results (TSV)") so readers don’t
assume coloc is always generated.
In `@docs_site/tools/qtlcascade.md`:
- Line 108: Several fenced code blocks in qtlcascade.md are missing language
identifiers (causing markdownlint MD040); update each untagged triple-backtick
block (the occurrences around the visible code fences) to include an appropriate
language tag (e.g., ```bash for shell/command blocks, ```text for
pseudo-formula/text blocks, ```python for Python snippets) so every ``` fence in
the file is changed to a tagged form (e.g., ```text, ```bash, or ```python) to
satisfy linting and ensure consistent rendering.
- Line 247: The table cell for `attenuation_ratio` contains raw pipe characters
which break Markdown table columns; update the description in the
`attenuation_ratio` cell (currently "1 - |pQTL_beta|/|eQTL_beta| (eqtl_mediated
only)") by escaping the inner pipe characters (e.g., replace
`|pQTL_beta|/|eQTL_beta|` with `\|pQTL_beta\|/\|eQTL_beta\|`) so the table
renders correctly.
In `@hvantk/commands/qtlcascade_cli.py`:
- Around line 248-250: The dry-run branch calls pipeline.show_plan() but
discards its return value, so nothing is printed; capture the returned formatted
plan from pipeline.show_plan() and output it (e.g., via click.echo or printing)
before returning so `hvantk qtlcascade run --dry-run` actually displays the
plan; update the block that checks dry_run to assign plan = pipeline.show_plan()
and echo/log that plan (use the CLI's existing echo/logger mechanism).
In `@hvantk/ptm/plot.py`:
- Around line 146-161: The displayed CI string uses a Unicode EN DASH between
result.enrichment_ci_low and ci_hi_str inside the ax1.text call; replace that
character with the ASCII hyphen-minus '-' in the f-string (the ax1.text block
that references ci_hi_str, result.enrichment_ci_low, and
result.enrichment_odds_ratio) so the confidence interval is rendered as
f"({result.enrichment_ci_low:.2f}-{ci_hi_str})" to avoid encoding issues.
In `@hvantk/qtlcascade/cascade.py`:
- Around line 77-104: The join currently only keys on (locus, alleles, gene_id)
so when no tissue filter is passed rows from different tissues collapse; fix by
including tissue in the join key whenever the tables contain a tissue column and
the caller did not supply a single tissue: detect if "tissue" in
list(eqtl_ht.row) or "tissue" in list(pqtl_ht.row) and if tissue is None, call
key_by/key on both eqtl_ht and pqtl_ht to include tissue (e.g., key_by(locus,
alleles, gene_id, tissue)) before performing ht = eqtl_ht.join(pqtl_ht,
how="outer"); alternatively raise a clear error requiring the tissue argument if
you prefer to force single-tissue inputs.
In `@hvantk/qtlcascade/coloc.py`:
- Around line 208-227: The join currently performed by eqtl_sel.join(pqtl_sel,
how="inner") can mix rows from different tissues when tissue is not provided;
update the logic so that if the tables contain a "tissue" field (check "tissue"
in list(eqtl_ht.row) and list(pqtl_ht.row)) you add tissue to the selected
fields (e.g., include tissue in eqtl_sel and pqtl_sel) and include tissue in the
join key, otherwise raise or require the tissue argument; modify the code around
eqtl_sel, pqtl_sel and the joined assignment to either require tissue or join on
(locus/position, alleles, gene_id, tissue) so rows from different tissues are
never merged.
In `@hvantk/qtlcascade/constants.py`:
- Around line 77-78: PQTL_SOURCES currently advertises multiple backends but
create_pqtl_tb raises NotImplementedError for all but gtex_fang; update the
constant to only list implemented backends (keep PQTL_SOURCES = ("gtex_fang",))
or split into two constants (e.g., PQTL_SOURCES_IMPLEMENTED and
PQTL_SOURCES_PLANNED) and update any CLI/help references to use the implemented
list; locate the PQTL_SOURCES definition in constants.py and the create_pqtl_tb
function in hvantk/tables/table_builders.py to ensure they are consistent.
In `@hvantk/qtlcascade/gene_summary.py`:
- Around line 91-97: After reading constraint_ht in gene_summary.py, validate
its key fields and types before calling constraint_ht.index(gene_ht.gene_id):
check that constraint_ht.key_fields (or constraint_ht.key) contains "gene_id"
and that the key field type matches the type of gene_ht.gene_id; if not, log an
error (including actual key fields and types) and raise or return early to avoid
silent missing lookups or TableIndexKeyError. Also wrap the .index(...) call in
a try/except for TableIndexKeyError to produce a clear error message referencing
constraint_ht and gene_ht rather than letting the raw exception propagate.
Ensure you update the constraint_idx/oe_lof_upper annotation only after
successful validation.
In `@hvantk/qtlcascade/pipeline.py`:
- Around line 209-217: The gene-summary is built before coloc runs so
result.coloc_df remains None; change the stage ordering to run
_stage_coloc(result, tissue) before _stage_gene_summary(result, tissue_dir) (or
alternatively call _stage_gene_summary again after _stage_coloc to merge coloc
into the summary); apply the same fix for the later block that mirrors lines
316-323 so any path that enables coloc populates result.coloc_df before
_stage_gene_summary is invoked.
- Around line 219-220: The single-tissue run path never generates the HTML
report because generate_report() is only called from run_collection(); ensure
report generation is invoked for the single-sample/run() path by calling
generate_report(...) at the end of the normal run flow (or by updating
_stage_outputs(result, safe_name) to call generate_report when appropriate).
Specifically, add a call to generate_report within run() after _stage_outputs
(or inside _stage_outputs when safe_name/result indicate single-tissue mode) so
that functions generate_report and _stage_outputs both produce the HTML report
for non-collection runs just like run_collection() does.
- Around line 339-345: The code currently filters the cascade table at
ht.filter(ht.cascade_class == "eqtl_mediated") when building cascade_genes,
which excludes genes labeled "discordant" that also have both eQTL and pQTL
signals; update the filter to include both classes (e.g., ht.cascade_class ==
"eqtl_mediated" or ht.cascade_class == "discordant", or use
hl.set(['eqtl_mediated','discordant']).contains(ht.cascade_class)) so
cascade_genes (derived from result.cascade_ht_path / ht.aggregate /
hl.agg.collect_as_set) includes discordant genes too.
In `@hvantk/qtlcascade/plot.py`:
- Around line 126-138: The empty-data branch currently returns a placeholder
figure without saving it, so downstream consumers miss the expected PNG; modify
the branch where both.empty is True to call the module's saving helper (e.g.,
_save_figure or whatever save routine is used elsewhere) with the created fig
and output_path before returning; ensure you pass the same figsize/format
parameters used elsewhere and keep the logger.warning and "No data" text, then
return the saved fig so the output_path file exists for report assembly.
In `@hvantk/qtlcascade/report.py`:
- Around line 89-96: Escape all user/data-derived strings before inserting into
HTML: import html and wrap values like tissues items, title, description, plot
names, TSV cell values, and any variables used in rows.append (e.g., tissues,
cls, cnt, pct, and other interpolated variables in the blocks at lines ~135-145,
181-185, 213-242) with html.escape(...). Update the code paths that build HTML
rows (the rows.append calls shown and similar ones elsewhere) to call
html.escape on each non-markup value and format numbers separately (don’t escape
numeric formatting), ensuring you escape join elements (e.g., ',
'.join(html.escape(t) for t in tissues)) and any string inserted into table
cells.
In `@hvantk/tables/table_builders.py`:
- Around line 1820-1839: The pQTL builder currently falls back to raw gene
symbols when gene_map_ht is omitted, which will prevent cascade joins because
create_eqtl_tb keys by Ensembl gene_id; change the logic in the block that uses
gene_map_ht so that if gene_map_ht is None you either raise a clear exception
(e.g., ValueError) requiring a gene_map_ht for cascade-compatible pQTL tables or
require an explicit opt-out flag (add a parameter like
allow_raw_gene_symbol=False) and only permit the fallback when that flag is set
true; update the code paths around gene_map_ht, ht.annotate(gene_id=...) and the
final ht.key_by("locus","alleles","gene_id") to enforce this check and surface a
helpful error message mentioning create_eqtl_tb and the cascade key tuple.
---
Outside diff comments:
In `@hvantk/tests/test_ptm.py`:
- Around line 224-233: Update the test_ptm_build_config_validation to use the
pytest tmp_path fixture instead of hardcoded "/tmp/test" strings: add tmp_path
as a test parameter, set output_dir to str(tmp_path) (or tmp_path.as_posix())
and output_ht to str(tmp_path / "out.ht") when constructing PTMBuildConfig
(refer to PTMBuildConfig and config2 in the test), so the test uses an
ephemeral, OS-independent directory and file path.
---
Nitpick comments:
In `@hvantk/commands/ptm_cli.py`:
- Line 191: The duplicate import of hail inside the ptm_annotate try block
should be removed: there is already an import hail earlier in that same try
(around Line 180), so delete the redundant "import hail" statement at the later
occurrence to avoid duplicate imports and improve clarity in the ptm_annotate
function/block.
In `@hvantk/data/gencc_streamer.py`:
- Around line 274-276: The logger.info call currently uses two adjacent
f-strings; replace them with a single f-string in the logger.info(...)
invocation so the message reads in one formatted string referencing len(genes)
and min_submitters (e.g., inside the same f-string), keeping the call to
logger.info and the variables genes and min_submitters unchanged for
readability.
In `@hvantk/datasets/uniprot_ptm_datasets.py`:
- Line 323: Replace the explicit str(e) conversion inside the f-string with the
f-string conversion flag {e!s} on the raise RuntimeError line (i.e., change
raise RuntimeError(f"Failed to download UniProt PTM data: {str(e)}") from e to
use {e!s}) so the exception text uses the explicit f-string conversion and
satisfies RUF010; keep the "from e" chaining intact.
In `@hvantk/ptm/plot.py`:
- Around line 120-140: The zip() calls iterating over bars_p/path_counts and
bars_b/benign_counts should be made strict to catch future length mismatches;
update the two zip(...) calls to use zip(..., strict=True) when iterating in the
loops that call ax1.text (referencing symbols bars_p, path_counts, bars_b,
benign_counts) so a ValueError is raised if lengths differ.
In `@hvantk/qtlcascade/__init__.py`:
- Around line 76-113: The __all__ export list is unsorted which triggers RUF022;
sort the entries in the __all__ list (e.g., "build_cascade",
"build_cascade_gene_summary", "CASCADE_CLASS_COLORS", "CASCADE_CLASSES",
"CascadeConfig", "CascadePipeline", "CascadeResult",
"DEFAULT_COLOC_H4_THRESHOLD", etc.) into a stable lexicographic order (or
replace with a manually sorted static ordering) so the list is deterministic and
lint-stable; update the existing __all__ definition accordingly while preserving
all original names.
In `@hvantk/qtlcascade/gene_summary.py`:
- Around line 75-76: The conditional checking "gene_symbol" in row_fields inside
gene_summary (the agg_exprs["gene_symbol"] = hl.agg.take(ht.gene_symbol, 1)[0]
branch) is dead with the current cascade builder (see build_cascade in
cascade.py not emitting gene_symbol); either remove the unreachable conditional
and associated agg_expr, or keep it but add a brief clarifying comment above
referencing build_cascade and that this is intentional defensive/future-proofing
for external cascades/schema changes so reviewers understand it is deliberate.
Ensure the comment names the symbols row_fields, agg_exprs, gene_symbol, and
build_cascade to make the rationale discoverable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6c03a748-d238-420c-8258-64120eaa0af7
⛔ Files ignored due to path filters (2)
docs_site/images/hvantk-architecture.svgis excluded by!**/*.svgdocs_site/images/hvantk-qtlcascade-workflow.svgis excluded by!**/*.svg
📒 Files selected for processing (40)
docs_site/examples/index.mddocs_site/examples/qtlcascade.mddocs_site/guide/data-sources.mddocs_site/guide/usage.mddocs_site/tools/qtlcascade.mdhvantk/annotation/annotate.pyhvantk/commands/gencc_downloader.pyhvantk/commands/genesets_cli.pyhvantk/commands/hgc/__init__.pyhvantk/commands/make_table_cli.pyhvantk/commands/ptm_cli.pyhvantk/commands/qtlcascade_cli.pyhvantk/commands/uniprot_ptm_downloader.pyhvantk/commands/validate_bgzf_cli.pyhvantk/data/cosmic_cgc_streamer.pyhvantk/data/file_utils.pyhvantk/data/gencc_streamer.pyhvantk/data/gene_disease_streamer.pyhvantk/datasets/uniprot_ptm_datasets.pyhvantk/hvantk.pyhvantk/psroc/pipeline.pyhvantk/ptm/analysis.pyhvantk/ptm/annotate.pyhvantk/ptm/constants.pyhvantk/ptm/mapper.pyhvantk/ptm/pipeline.pyhvantk/ptm/plot.pyhvantk/ptm/report.pyhvantk/qtlcascade/__init__.pyhvantk/qtlcascade/cascade.pyhvantk/qtlcascade/coloc.pyhvantk/qtlcascade/constants.pyhvantk/qtlcascade/gene_summary.pyhvantk/qtlcascade/pipeline.pyhvantk/qtlcascade/plot.pyhvantk/qtlcascade/report.pyhvantk/tables/registry.pyhvantk/tables/table_builders.pyhvantk/tests/test_ptm.pyhvantk/utils/table_utils.py
| def _import_eqtl_eqtlgen(input_path, reference_genome): | ||
| """Import eQTLGen cis-eQTL summary statistics.""" | ||
| logger.info("Importing eQTLGen: %s", input_path) | ||
| ht = hl.import_table( | ||
| input_path, | ||
| force=True, | ||
| types={"Pvalue": hl.tfloat64, "Zscore": hl.tfloat64}, | ||
| ) | ||
| # Construct a GTEx-format variant_id so the shared parser can handle it. | ||
| if reference_genome.startswith("GRCh38"): | ||
| contig = hl.format("chr%s", ht.SNPChr) | ||
| else: | ||
| contig = ht.SNPChr | ||
| ht = ht.select( | ||
| gene_id_raw=ht.Gene, | ||
| variant_id=hl.delimit( | ||
| [contig, ht.SNPPos, ht.OtherAllele, ht.AssessedAllele, "b37"], | ||
| "_", | ||
| ), | ||
| beta=hl.missing(hl.tfloat64), # eQTLGen provides Z-score, not beta | ||
| se=hl.missing(hl.tfloat64), | ||
| p_value=ht.Pvalue, | ||
| maf=hl.missing(hl.tfloat64), | ||
| tissue="Blood", | ||
| gene_symbol=ht.GeneSymbol, | ||
| ) |
There was a problem hiding this comment.
eqtlgen is exposed as supported, but its tables are unusable downstream.
This path writes beta and se as missing for every row. The cascade builder reads those fields to classify overlap direction, and the coloc driver requires se, so an eqtlgen table cannot participate correctly in either workflow. Either derive usable effect statistics here or fail fast when source="eqtlgen" is requested for cascade/coloc inputs.
| if gene_map_ht: | ||
| logger.info("Mapping gene symbols → Ensembl IDs via %s", gene_map_ht) | ||
| gm = hl.read_table(gene_map_ht) | ||
| rev = ( | ||
| gm.key_by() | ||
| .select("gene_id", "gene_name") | ||
| .key_by("gene_name") | ||
| .distinct() | ||
| ) | ||
| ht = ht.annotate( | ||
| gene_id=hl.or_else(rev[ht.gene_symbol].gene_id, ht.gene_symbol), | ||
| ) | ||
| else: | ||
| ht = ht.annotate(gene_id=ht.gene_symbol) | ||
|
|
||
| if p_threshold is not None and p_threshold > 0: | ||
| ht = ht.filter(ht.p_value <= p_threshold) | ||
|
|
||
| ht = ht.annotate(source=source, is_cis=True) | ||
| ht = ht.key_by("locus", "alleles", "gene_id") |
There was a problem hiding this comment.
Make Ensembl gene mapping mandatory for cascade-compatible pQTL tables.
When gene_map_ht is omitted, Line 1833 keys the pQTL table by raw gene symbol. create_eqtl_tb() keys eQTL rows by Ensembl gene_id, and the cascade join uses (locus, alleles, gene_id), so the command can successfully build a pQTL HT that is almost guaranteed not to join with any eQTL HT. This should hard-fail or at least require an explicit opt-out for non-cascade use.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/tables/table_builders.py` around lines 1820 - 1839, The pQTL builder
currently falls back to raw gene symbols when gene_map_ht is omitted, which will
prevent cascade joins because create_eqtl_tb keys by Ensembl gene_id; change the
logic in the block that uses gene_map_ht so that if gene_map_ht is None you
either raise a clear exception (e.g., ValueError) requiring a gene_map_ht for
cascade-compatible pQTL tables or require an explicit opt-out flag (add a
parameter like allow_raw_gene_symbol=False) and only permit the fallback when
that flag is set true; update the code paths around gene_map_ht,
ht.annotate(gene_id=...) and the final ht.key_by("locus","alleles","gene_id") to
enforce this check and surface a helpful error message mentioning create_eqtl_tb
and the cascade key tuple.
…pdate documentation for tissue restriction
…and functionality
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (1)
hvantk/qtlcascade/pipeline.py (1)
76-91:⚠️ Potential issue | 🟠 MajorReject invalid thresholds and coloc priors in
validate().
validate()currently accepts values likeeqtl_p_threshold=-1,pqtl_p_threshold=2, or non-positive coloc priors/window sizes. Those inputs silently disable filters or feed nonsensical parameters into coloc instead of failing fast.Suggested guardrails
def validate(self) -> List[str]: """Return a list of validation errors (empty if valid).""" errors: List[str] = [] if not self.eqtl_ht: errors.append("eqtl_ht is required.") if not self.pqtl_ht: errors.append("pqtl_ht is required.") if not self.output_dir: errors.append("output_dir is required.") # Coloc needs both allpairs if bool(self.eqtl_allpairs_ht) != bool(self.pqtl_allpairs_ht): errors.append( "Both eqtl_allpairs_ht and pqtl_allpairs_ht are required " "for colocalization (or omit both to skip coloc)." ) + if not 0 <= self.eqtl_p_threshold <= 1: + errors.append("eqtl_p_threshold must be between 0 and 1.") + if not 0 <= self.pqtl_p_threshold <= 1: + errors.append("pqtl_p_threshold must be between 0 and 1.") + if self.coloc_window_kb <= 0: + errors.append("coloc_window_kb must be > 0.") + for name, value in ( + ("coloc_p1", self.coloc_p1), + ("coloc_p2", self.coloc_p2), + ("coloc_p12", self.coloc_p12), + ): + if not 0 < value < 1: + errors.append(f"{name} must be between 0 and 1.") + if self.coloc_W <= 0: + errors.append("coloc_W must be > 0.") return errors🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/qtlcascade/pipeline.py` around lines 76 - 91, The validate() method must reject out-of-range thresholds and coloc parameters: add checks in validate() to ensure eqtl_p_threshold and pqtl_p_threshold are between 0 and 1 (exclusive of negatives or >1) and, if present, that coloc priors (e.g., coloc_prior / coloc_prior_secondary) are >0 and <=1 and coloc window/size parameters are positive integers; append descriptive error strings to errors when any of these constraints fail, keeping the existing eqtl_allpairs_ht/pqtl_allpairs_ht consistency check (use the validate() function and attribute names eqtl_p_threshold, pqtl_p_threshold, coloc_prior, coloc_window to locate where to add the new guards).
🧹 Nitpick comments (5)
hvantk/qtlcascade/plot.py (1)
338-340: Consider addingstrict=Truetozip()for defensive coding.The
zip()at line 338 combinesbp["boxes"]andcolors, which are derived from the sameclasseslist so their lengths should always match. Addingstrict=Truewould catch any future refactoring bugs where lengths could diverge.Example fix
- for patch, color in zip(bp["boxes"], colors): + for patch, color in zip(bp["boxes"], colors, strict=True):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/qtlcascade/plot.py` around lines 338 - 340, The loop using zip(bp["boxes"], colors) should use zip(..., strict=True) to defensively ensure bp["boxes"] and colors have identical lengths; update the iteration (where bp["boxes"] and colors are created from classes) to call zip(bp["boxes"], colors, strict=True) so any future mismatch between bp["boxes"] and colors raises an error immediately.hvantk/tests/test_cli_qtlcascade.py (1)
76-116: Consider usingtmp_pathfixture instead of hardcoded/tmppaths.The tests at lines 87 and 112 use hardcoded
/tmp/cascade_outpaths. While the pipeline is mocked so no actual files are written, using pytest'stmp_pathfixture would be more consistent with the other tests in this file and avoid static analysis warnings.Example fix
-def test_run_cmd_dry_run(): +def test_run_cmd_dry_run(tmp_path): runner = CliRunner() result = runner.invoke( qtlcascade_group, [ "run", "--eqtl-ht", "/data/eqtl.ht", "--pqtl-ht", "/data/pqtl.ht", "-o", - "/tmp/cascade_out", + str(tmp_path / "cascade_out"), "--dry-run", ], )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/tests/test_cli_qtlcascade.py` around lines 76 - 116, Update the two tests test_run_cmd_dry_run and test_run_cmd_single_tissue to use the pytest tmp_path fixture instead of the hardcoded "/tmp/cascade_out": add tmp_path as a test parameter, build an output path from tmp_path (e.g. tmp_path / "cascade_out" converted to a string) and pass that string to the CLI args in place of "/tmp/cascade_out"; ensure both test functions (and the runner.invoke calls) use the new tmp_path-derived path so the tests use an isolated temporary directory.hvantk/qtlcascade/coloc.py (2)
80-85:_logdiffcan return-inforNaNwhena <= b.The function assumes
a > bper the docstring, but ifsum_abf1 + sum_abf2 <= sum_abf_bothdue to numerical precision issues,np.log1p(-np.exp(b - a))will return-inf(when equal) orNaN(whenb > a).Consider adding a guard similar to the R coloc implementation:
Add numerical guard
def _logdiff(a: float, b: float) -> float: """Compute ``log(exp(a) - exp(b))`` in log-space (requires ``a > b``). Matches ``logdiff`` in the R coloc package (``R/claudia.R``). """ + if b >= a: + return float("-inf") return a + np.log1p(-np.exp(b - a))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/qtlcascade/coloc.py` around lines 80 - 85, The _logdiff function can produce -inf/NaN when a <= b; add a numerical guard at the start of _logdiff to handle that case (e.g., if a <= b + eps) and return -np.inf (log(0)) instead of computing np.log1p(-np.exp(b - a)); otherwise proceed with the existing computation. Use a small epsilon (e.g., 1e-12) to tolerate tiny rounding errors and reference the function name _logdiff so you update that exact function.
219-232: Warning for multi-tissue join mismatch is helpful but may not prevent data corruption.When
tissueis not provided but allpairs tables contain a tissue field, the code now emits a warning (lines 228-232). However, the join still proceeds on(locus, alleles, gene_id)only, which can incorrectly merge rows from different tissues into one coloc region.The previous review suggested either requiring
tissuewhen the field exists, or including tissue in the join key. The current approach relies on users heeding the warning, which may be insufficient for data integrity.Consider enforcing tissue parameter when tissue field exists
elif eqtl_has_tissue or pqtl_has_tissue: - logger.warning( - "No tissue filter provided but allpairs table(s) contain a " - "'tissue' field. Inner join on (locus, alleles, gene_id) may " - "cross-multiply rows from different tissues." - ) + raise ValueError( + "No tissue filter provided but allpairs table(s) contain a " + "'tissue' field. Provide --tissue to avoid cross-tissue joins." + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/qtlcascade/coloc.py` around lines 219 - 232, The code currently only logs a warning when eqtl_ht or pqtl_ht contain a "tissue" column but no tissue parameter is provided, which allows incorrect cross-tissue joins; change this to require an explicit tissue: when eqtl_has_tissue or pqtl_has_tissue and tissue is falsy, raise a ValueError (or use logger.error then raise) with a clear message instructing the caller to provide the tissue parameter; keep the existing filtering logic that applies eqtl_ht.filter(eqtl_ht.tissue == tissue) and pqtl_ht.filter(pqtl_ht.tissue == tissue) when tissue is provided (referencing eqtl_ht, pqtl_ht, and tissue), and remove or replace the logger.warning block so the function fails fast instead of proceeding with a risky join.hvantk/tables/table_builders.py (1)
1534-1569: Parquet import requires Spark session to exist; no explicit initialization.
_import_eqtl_gtex_parquetcallsSparkSession.builder.getOrCreate()which assumes a Spark context is already configured. If called before Hail initialization or outside a Spark environment, this may fail silently or with a confusing error.Consider documenting this requirement or adding a check/initialization step.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hvantk/ptm/plot.py`:
- Around line 120-139: Update the two zip calls in hvantk/ptm/plot.py to use
strict matching: replace zip(bars_p, path_counts) with zip(bars_p, path_counts,
strict=True) and zip(bars_b, benign_counts) with zip(bars_b, benign_counts,
strict=True) so the loops that call ax1.text(...) on bars_p/path_counts and
bars_b/benign_counts explicitly enforce equal lengths and satisfy Ruff B905.
In `@hvantk/qtlcascade/pipeline.py`:
- Around line 464-467: The report currently collects all PNGs from the shared
plots directory via self._plots_dir.glob("*.png"), causing stale images from
other runs to appear; change plot discovery in the plot_paths construction (and
the similar block around lines 510-513) to only include images produced by the
current run—e.g., restrict the glob or filter by the current run
identifier/output directory used when saving plots (use self.output_dir or
self.run_id / run_name as a prefix or parent directory) so plot_paths only
contains files belonging to this invocation.
- Around line 201-202: The sanitized tissue slug produced by _sanitize() is
lossy and can collide (e.g., "Brain/Cortex" vs "Brain Cortex"), causing
tissue_dir and plot filenames to overwrite; update the code in run and
run_collection to derive a stable, unique slug from tissue_label by combining
the sanitized name with a short deterministic suffix (e.g., first 8 chars of a
hash or base64 digest of tissue_label) and use that single unique_slug for
per_tissue paths and plot filenames (replace uses of safe_name with unique_slug
for tissue_dir and any plots/<slug>_* generation, including the other
occurrences around the 245-253 and 539-543 regions) so distinct tissues never
collapse to the same output path.
- Around line 459-462: The try/except around hl.read_table(...).to_pandas()
silently swallows failures (setting gene_df to None) — change the except to
catch Exception as e, log a clear error including result.gene_summary_ht_path
and the exception details (e) via the module logger (or raise a new RuntimeError
with that context) and then either re-raise so report generation fails fast or
explicitly mark the report as failed; apply the same change to the other
identical block that reads the gene summary (the block using
hl.read_table(...).to_pandas() and gene_df).
- Around line 329-331: _in _stage_gene_summary(), gene_ht.export(tsv_path) does
not honor self.config.overwrite, causing inconsistent rerun behavior; modify the
code around the export so that before calling gene_ht.export(tsv_path) you check
self.config.overwrite and if true remove any existing tsv_path (use the
appropriate filesystem removal for your execution environment: local os.remove
or cloud/hadoop rm via hail/utils or subprocess), and if overwrite is false and
tsv_path already exists raise/exit with a clear error; do this adjacent to the
existing call sites _stage_gene_summary, build_cascade_gene_summary, and the
tissue_dir/gene_summary.tsv path to ensure consistent behavior.
In `@hvantk/tables/table_builders.py`:
- Around line 1507-1526: _scan_tissue_files currently uses pathlib.Path and will
fail for cloud URIs (gs://, s3://); replace the Path-based logic with a
cloud-aware filesystem like hailtop.fs (or hailtop.fs.get_fs) similar to
_cleanup_temp_file: detect cloud URIs (e.g., input_path.startswith("gs://") or
"s3://"), use the filesystem client to check existence and file/dir semantics,
list matching files using the fs.glob or fs.ls APIs for each extension, and
build the [(file_path, tissue_name)] list by extracting the filename stem (split
on "/" then split basename on "." to get the tissue prefix) so both local and
cloud paths are supported by _scan_tissue_files.
---
Duplicate comments:
In `@hvantk/qtlcascade/pipeline.py`:
- Around line 76-91: The validate() method must reject out-of-range thresholds
and coloc parameters: add checks in validate() to ensure eqtl_p_threshold and
pqtl_p_threshold are between 0 and 1 (exclusive of negatives or >1) and, if
present, that coloc priors (e.g., coloc_prior / coloc_prior_secondary) are >0
and <=1 and coloc window/size parameters are positive integers; append
descriptive error strings to errors when any of these constraints fail, keeping
the existing eqtl_allpairs_ht/pqtl_allpairs_ht consistency check (use the
validate() function and attribute names eqtl_p_threshold, pqtl_p_threshold,
coloc_prior, coloc_window to locate where to add the new guards).
---
Nitpick comments:
In `@hvantk/qtlcascade/coloc.py`:
- Around line 80-85: The _logdiff function can produce -inf/NaN when a <= b; add
a numerical guard at the start of _logdiff to handle that case (e.g., if a <= b
+ eps) and return -np.inf (log(0)) instead of computing np.log1p(-np.exp(b -
a)); otherwise proceed with the existing computation. Use a small epsilon (e.g.,
1e-12) to tolerate tiny rounding errors and reference the function name _logdiff
so you update that exact function.
- Around line 219-232: The code currently only logs a warning when eqtl_ht or
pqtl_ht contain a "tissue" column but no tissue parameter is provided, which
allows incorrect cross-tissue joins; change this to require an explicit tissue:
when eqtl_has_tissue or pqtl_has_tissue and tissue is falsy, raise a ValueError
(or use logger.error then raise) with a clear message instructing the caller to
provide the tissue parameter; keep the existing filtering logic that applies
eqtl_ht.filter(eqtl_ht.tissue == tissue) and pqtl_ht.filter(pqtl_ht.tissue ==
tissue) when tissue is provided (referencing eqtl_ht, pqtl_ht, and tissue), and
remove or replace the logger.warning block so the function fails fast instead of
proceeding with a risky join.
In `@hvantk/qtlcascade/plot.py`:
- Around line 338-340: The loop using zip(bp["boxes"], colors) should use
zip(..., strict=True) to defensively ensure bp["boxes"] and colors have
identical lengths; update the iteration (where bp["boxes"] and colors are
created from classes) to call zip(bp["boxes"], colors, strict=True) so any
future mismatch between bp["boxes"] and colors raises an error immediately.
In `@hvantk/tests/test_cli_qtlcascade.py`:
- Around line 76-116: Update the two tests test_run_cmd_dry_run and
test_run_cmd_single_tissue to use the pytest tmp_path fixture instead of the
hardcoded "/tmp/cascade_out": add tmp_path as a test parameter, build an output
path from tmp_path (e.g. tmp_path / "cascade_out" converted to a string) and
pass that string to the CLI args in place of "/tmp/cascade_out"; ensure both
test functions (and the runner.invoke calls) use the new tmp_path-derived path
so the tests use an isolated temporary directory.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5922f759-f2df-4dcc-b255-37aec4fefa9a
📒 Files selected for processing (15)
docs_site/examples/index.mddocs_site/tools/qtlcascade.mdhvantk/commands/make_table_cli.pyhvantk/commands/qtlcascade_cli.pyhvantk/ptm/plot.pyhvantk/qtlcascade/cascade.pyhvantk/qtlcascade/coloc.pyhvantk/qtlcascade/constants.pyhvantk/qtlcascade/gene_summary.pyhvantk/qtlcascade/pipeline.pyhvantk/qtlcascade/plot.pyhvantk/qtlcascade/report.pyhvantk/tables/table_builders.pyhvantk/tests/test_cli_mktable_qtl.pyhvantk/tests/test_cli_qtlcascade.py
✅ Files skipped from review due to trivial changes (2)
- docs_site/examples/index.md
- hvantk/qtlcascade/constants.py
🚧 Files skipped from review as they are similar to previous changes (3)
- hvantk/qtlcascade/gene_summary.py
- hvantk/commands/make_table_cli.py
- hvantk/qtlcascade/cascade.py
| for bar, n in zip(bars_p, path_counts): | ||
| if n > 0: | ||
| ax1.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), | ||
| str(n), ha="center", va="bottom", fontsize=8) | ||
| ax1.text( | ||
| bar.get_x() + bar.get_width() / 2, | ||
| bar.get_height(), | ||
| str(n), | ||
| ha="center", | ||
| va="bottom", | ||
| fontsize=8, | ||
| ) | ||
| for bar, n in zip(bars_b, benign_counts): | ||
| if n > 0: | ||
| ax1.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), | ||
| str(n), ha="center", va="bottom", fontsize=8) | ||
| ax1.text( | ||
| bar.get_x() + bar.get_width() / 2, | ||
| bar.get_height(), | ||
| str(n), | ||
| ha="center", | ||
| va="bottom", | ||
| fontsize=8, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify the repo's Python baseline supports zip(strict=...) and matches Ruff target.
for f in pyproject.toml setup.cfg setup.py tox.ini ruff.toml .ruff.toml; do
if [ -f "$f" ]; then
echo "== $f =="
rg -n 'requires-python|python_requires|target-version|\[tool\.ruff\]|\[project\]|\[metadata\]' "$f" || true
echo
fi
doneRepository: bigbio/hvantk
Length of output: 95
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Read the full content of pyproject.toml to find Python version requirements
echo "=== Content of pyproject.toml ==="
if [ -f "pyproject.toml" ]; then
cat pyproject.toml
else
echo "pyproject.toml not found"
fi
echo ""
echo "=== Content of ruff.toml ==="
if [ -f "ruff.toml" ]; then
cat ruff.toml
else
echo "ruff.toml not found"
fi
echo ""
echo "=== Checking for Python version in any config files ==="
rg -i "python" --type toml --type cfg --type txt . || echo "No matches found"Repository: bigbio/hvantk
Length of output: 2839
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify the actual code at lines 120-139 in hvantk/ptm/plot.py
echo "=== Code at lines 120-139 of hvantk/ptm/plot.py ==="
sed -n '115,145p' hvantk/ptm/plot.py | cat -n
echo ""
echo "=== Check if strict= parameter is used anywhere in the codebase ==="
rg "zip\(" hvantk/ -A 1 -B 1Repository: bigbio/hvantk
Length of output: 7176
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Get full context of the hvantk/ptm/plot.py file to identify all zip() calls
wc -l hvantk/ptm/plot.py
echo "---"
# Extract lines with zip() to show context
rg -n "zip\(" hvantk/ptm/plot.py -B 2 -A 2Repository: bigbio/hvantk
Length of output: 1485
Add explicit strict= to both zip() loops to align with B905 lint rule.
Lines 120 and 130 trigger Ruff B905. Since the project targets Python 3.10+, strict=True can be used to make the length-matching contract explicit and catch any misalignment early.
🔧 Proposed fix
- for bar, n in zip(bars_p, path_counts):
+ for bar, n in zip(bars_p, path_counts, strict=True):
if n > 0:
ax1.text(
bar.get_x() + bar.get_width() / 2,
bar.get_height(),
str(n),
ha="center",
va="bottom",
fontsize=8,
)
- for bar, n in zip(bars_b, benign_counts):
+ for bar, n in zip(bars_b, benign_counts, strict=True):
if n > 0:
ax1.text(
bar.get_x() + bar.get_width() / 2,
bar.get_height(),
str(n),
ha="center",
va="bottom",
fontsize=8,
)🧰 Tools
🪛 Ruff (0.15.7)
[warning] 120-120: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
[warning] 130-130: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/ptm/plot.py` around lines 120 - 139, Update the two zip calls in
hvantk/ptm/plot.py to use strict matching: replace zip(bars_p, path_counts) with
zip(bars_p, path_counts, strict=True) and zip(bars_b, benign_counts) with
zip(bars_b, benign_counts, strict=True) so the loops that call ax1.text(...) on
bars_p/path_counts and bars_b/benign_counts explicitly enforce equal lengths and
satisfy Ruff B905.
| safe_name = _sanitize(tissue_label) | ||
| tissue_dir = str(self._per_tissue_dir / safe_name) |
There was a problem hiding this comment.
Sanitized tissue names can overwrite each other's outputs.
_sanitize() is lossy, and run() uses that slug directly for both per_tissue/<slug> and plots/<slug>_*. Distinct tissues like Brain/Cortex and Brain Cortex collapse to the same slug, so the later run overwrites the earlier artifacts and run_collection() can read the same path under multiple tissue labels.
Also applies to: 245-253, 539-543
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/qtlcascade/pipeline.py` around lines 201 - 202, The sanitized tissue
slug produced by _sanitize() is lossy and can collide (e.g., "Brain/Cortex" vs
"Brain Cortex"), causing tissue_dir and plot filenames to overwrite; update the
code in run and run_collection to derive a stable, unique slug from tissue_label
by combining the sanitized name with a short deterministic suffix (e.g., first 8
chars of a hash or base64 digest of tissue_label) and use that single
unique_slug for per_tissue paths and plot filenames (replace uses of safe_name
with unique_slug for tissue_dir and any plots/<slug>_* generation, including the
other occurrences around the 245-253 and 539-543 regions) so distinct tissues
never collapse to the same output path.
| # Export TSV for downstream use | ||
| tsv_path = f"{tissue_dir}/gene_summary.tsv" | ||
| gene_ht.export(tsv_path) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /tmp && git ls-files | grep -E "pipeline\.py$"Repository: bigbio/hvantk
Length of output: 125
🏁 Script executed:
cat -n hvantk/qtlcascade/pipeline.py | sed -n '320,340p'Repository: bigbio/hvantk
Length of output: 1078
🏁 Script executed:
rg -n "\.export\(|\.write\(|\.checkpoint\(" hvantk/qtlcascade/pipeline.py -A 2Repository: bigbio/hvantk
Length of output: 182
🌐 Web query:
Hail Table.export() API overwrite parameter documentation
💡 Result:
The Hail Table.export method does not have an overwrite parameter. It exports the table to a text file (typically TSV, recommended with .bgz extension for compression). Unlike Table.write, which has an overwrite parameter to overwrite existing files at the destination, Table.export lacks this option in the official documentation and source code. Usage example: table1.export('output/table1.tsv.bgz') Note: It is highly recommended to export large files with a .bgz extension, which uses block gzipped compression. These files can be read natively with Hail methods and tools like Python's gzip.open or R's read.table. Nested structures export as JSON; use Table.flatten first for separate fields.
Citations:
- 1: https://hail.is/docs/0.2/methods/impex.html
- 2: https://hail.is/docs/latest/hail.Table.html
- 3: https://hail.is/docs/0.2/_modules/hail/table.html
- 4: https://hail.is/docs/0.2/hail.Table.html
🌐 Web query:
Hail Table.write() overwrite parameter documentation
💡 Result:
The overwrite parameter in Hail's Table.write method is a boolean option that determines whether to overwrite an existing file at the specified output path. Signature: table.write(output, stage_locally=False, overwrite=False) - overwrite (bool): If True, overwrite an existing file at the destination. Defaults to False. Example: table1.write('output/table1.ht', overwrite=True) Warning: Do not write or checkpoint to a path that is already an input source for the query, as this can cause data loss. This is from the official Hail documentation for hail.Table.write.
Citations:
🌐 Web query:
Hail Table.checkpoint() overwrite parameter documentation
💡 Result:
The Table.checkpoint method in Hail checkpoints the table to disk by writing it to the specified path and then reading it back. It has an overwrite parameter of type bool. If overwrite=True, it will overwrite an existing file at the destination path. The default value is not explicitly stated in the documentation, but usage examples (e.g., write(..., overwrite=True)) suggest it may default to False. Full parameters: - output (str): Path at which to write. - stage_locally (bool): If True, major output will be written to temporary local storage before being copied to output. - overwrite (bool): If True, overwrite an existing file at the destination. Warning: Do not checkpoint to a path that is already an input source for the query, as this can cause data loss. Returns: Table Example usage: table = table.checkpoint('output/table.ht', overwrite=True)
Citations:
- 1: https://hail.is/docs/latest/hail.Table.html
- 2: https://hail.is/docs/0.2/_modules/hail/table.html
- 3: https://hail.is/docs/0.2/hail.Table.html
🏁 Script executed:
cat -n hvantk/qtlcascade/pipeline.py | sed -n '300,335p'Repository: bigbio/hvantk
Length of output: 1678
🏁 Script executed:
rg -n "def _stage" hvantk/qtlcascade/pipeline.py | head -20Repository: bigbio/hvantk
Length of output: 284
🏁 Script executed:
cat -n hvantk/qtlcascade/pipeline.py | sed -n '285,335p'Repository: bigbio/hvantk
Length of output: 2330
gene_ht.export() lacks overwrite handling unlike other writes in this stage.
In _stage_gene_summary(), the call to build_cascade_gene_summary() (line 317) receives overwrite=self.config.overwrite, but gene_ht.export(tsv_path) (line 331) cannot use the same pattern—Hail's Table.export() API does not expose an overwrite parameter, unlike Table.write() and Table.checkpoint() which do. This creates inconsistent behavior on reruns to an existing output directory. Consider an explicit delete or versioned-path strategy to ensure reruns behave as intended.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/qtlcascade/pipeline.py` around lines 329 - 331, _in
_stage_gene_summary(), gene_ht.export(tsv_path) does not honor
self.config.overwrite, causing inconsistent rerun behavior; modify the code
around the export so that before calling gene_ht.export(tsv_path) you check
self.config.overwrite and if true remove any existing tsv_path (use the
appropriate filesystem removal for your execution environment: local os.remove
or cloud/hadoop rm via hail/utils or subprocess), and if overwrite is false and
tsv_path already exists raise/exit with a clear error; do this adjacent to the
existing call sites _stage_gene_summary, build_cascade_gene_summary, and the
tissue_dir/gene_summary.tsv path to ensure consistent behavior.
| plot_paths = {} | ||
| for p in self._plots_dir.glob("*.png"): | ||
| plot_paths[p.stem] = str(p) | ||
|
|
There was a problem hiding this comment.
Scope report plots to the current run.
Both report builders glob every PNG in the shared plots directory. Reusing output_dir pulls stale images from earlier tissues/runs into the new HTML report, so the report can show plots that were not generated for the current invocation.
Also applies to: 510-513
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/qtlcascade/pipeline.py` around lines 464 - 467, The report currently
collects all PNGs from the shared plots directory via
self._plots_dir.glob("*.png"), causing stale images from other runs to appear;
change plot discovery in the plot_paths construction (and the similar block
around lines 510-513) to only include images produced by the current run—e.g.,
restrict the glob or filter by the current run identifier/output directory used
when saving plots (use self.output_dir or self.run_id / run_name as a prefix or
parent directory) so plot_paths only contains files belonging to this
invocation.
| def _scan_tissue_files(input_path, extensions): | ||
| """Return ``[(file_path, tissue_name), ...]`` from *input_path*. | ||
|
|
||
| Tissue name is inferred from the filename prefix before the first dot | ||
| (e.g. ``Brain_Cortex.v8.signif_variant_gene_pairs.txt.gz`` → ``Brain_Cortex``). | ||
| Accepts a single file or a directory. | ||
| """ | ||
| from pathlib import Path | ||
|
|
||
| p = Path(input_path) | ||
| if p.is_file(): | ||
| return [(str(p), p.stem.split(".")[0])] | ||
| if not p.is_dir(): | ||
| raise FileNotFoundError(f"Not a file or directory: {input_path}") | ||
| matches = [] | ||
| for ext in extensions: | ||
| matches.extend(sorted(p.glob(f"*{ext}"))) | ||
| if not matches: | ||
| raise FileNotFoundError(f"No files matching {extensions} in {input_path}") | ||
| return [(str(f), f.stem.split(".")[0]) for f in matches] |
There was a problem hiding this comment.
_scan_tissue_files won't work with cloud URIs (gs://, s3://).
This helper uses pathlib.Path which only handles local filesystem paths. The eQTL/pQTL builders accept input_path that could be cloud URIs, but Path.is_file(), Path.is_dir(), and Path.glob() will fail or return incorrect results for such paths.
Consider using Hail's hailtop.fs or similar cloud-aware filesystem utilities, similar to how _cleanup_temp_file handles this at lines 117-144.
🛠️ Suggested approach
def _scan_tissue_files(input_path, extensions):
- from pathlib import Path
+ try:
+ import hailtop.fs as hfs
+ except ImportError:
+ hfs = None
- p = Path(input_path)
- if p.is_file():
- return [(str(p), p.stem.split(".")[0])]
- if not p.is_dir():
- raise FileNotFoundError(f"Not a file or directory: {input_path}")
+ # Check if cloud URI
+ if input_path.startswith(("gs://", "s3://", "hdfs://")):
+ if hfs is None:
+ raise ImportError("hailtop.fs required for cloud URIs")
+ # Use hailtop.fs for cloud paths
+ if hfs.is_file(input_path):
+ name = input_path.rsplit("/", 1)[-1].split(".")[0]
+ return [(input_path, name)]
+ # ... cloud directory listing logic
+ else:
+ from pathlib import Path
+ p = Path(input_path)
+ # existing local logic...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/tables/table_builders.py` around lines 1507 - 1526, _scan_tissue_files
currently uses pathlib.Path and will fail for cloud URIs (gs://, s3://); replace
the Path-based logic with a cloud-aware filesystem like hailtop.fs (or
hailtop.fs.get_fs) similar to _cleanup_temp_file: detect cloud URIs (e.g.,
input_path.startswith("gs://") or "s3://"), use the filesystem client to check
existence and file/dir semantics, list matching files using the fs.glob or fs.ls
APIs for each extension, and build the [(file_path, tissue_name)] list by
extracting the filename stem (split on "/" then split basename on "." to get the
tissue prefix) so both local and cloud paths are supported by
_scan_tissue_files.
There was a problem hiding this comment.
♻️ Duplicate comments (3)
hvantk/qtlcascade/pipeline.py (3)
468-472:⚠️ Potential issue | 🟠 MajorReports can still pick up stale plots from earlier runs.
Both paths rebuild
plot_pathsby globbing a shared directory instead of using the files generated by this invocation. Reusingoutput_dircan leak old<slug>_*.pngfiles or a stalecross_tissue_heatmap.pnginto the new HTML. Tracking successful plot writes onCascadeResultwould keep the report scoped to the current run.Also applies to: 519-522
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/qtlcascade/pipeline.py` around lines 468 - 472, The current code globbing self._plots_dir (using _sanitize(result.tissue) and building plot_paths from files matching "{safe}_*.png") can include stale images from previous runs; instead, record the exact plot files written during this invocation on the CascadeResult (e.g., add/append to a result.plots list or result.plot_paths when each plot is written) and use that list to build plot_paths; similarly, stop globbing for "cross_tissue_heatmap.png" and read it only if it was recorded on the CascadeResult for this run. Update the plot-writing functions to append successful writes to the CascadeResult (unique symbols: CascadeResult, result.plots or result.plot_paths, self._plots_dir, _sanitize, cross_tissue_heatmap.png) and replace the glob-based construction with iteration over the recorded list.
201-202:⚠️ Potential issue | 🟠 MajorMake the tissue slug collision-resistant.
_sanitize()is lossy, so distinct labels likeBrain/CortexandBrain Cortexstill collapse to the same slug. Since that value drivesper_tissuedirectories and plot prefixes, one tissue can overwrite another's artifacts. Generate a unique slug once per run and reuse it everywhere.Also applies to: 549-553
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/qtlcascade/pipeline.py` around lines 201 - 202, The current use of _sanitize(tissue_label) is lossy and can collapse distinct labels (e.g. "Brain/Cortex" vs "Brain Cortex"); replace ad-hoc calls with a single-run mapping that generates a collision-resistant slug per tissue_label (e.g. sanitized base plus a short deterministic hash or UUID derived from the original tissue_label) and reuse that slug wherever safe_name or tissue_dir and plot prefixes are produced (locations using _sanitize, including the code that sets safe_name and tissue_dir and the later usages around the plot prefix code). Implement a map on the pipeline instance (e.g. self._tissue_slug_map) populated once when tissues are enumerated and then reference self._tissue_slug_map[tissue_label] instead of calling _sanitize directly. Ensure the slug remains stable for the duration of the run and is used for per-tissue directories and plot prefixes.
329-331:⚠️ Potential issue | 🟠 Major
gene_summary.tsvstill ignoresoverwrite.The HT checkpoint honors
self.config.overwrite, butgene_ht.export(tsv_path)does not. A rerun into the same tissue directory can therefore succeed ongene_summary.htand then fail whengene_summary.tsvalready exists. Handle the destination explicitly before callingexport().Does Hail `Table.export()` support an `overwrite` parameter, and what is the documented behavior if the destination path already exists?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/qtlcascade/pipeline.py` around lines 329 - 331, The export step ignores self.config.overwrite: before calling gene_ht.export(tsv_path) check whether the destination exists and honor self.config.overwrite — if overwrite is true, remove the existing tsv (and any related compressed variant) using the appropriate filesystem helper (Hail's hadoop utilities for remote paths or os.unlink for local paths), and if overwrite is false and the file exists, raise or fail early with a clear message; then call gene_ht.export(tsv_path). Reference gene_ht, export(), tsv_path, tissue_dir and self.config.overwrite to locate where to add the existence check and deletion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@hvantk/qtlcascade/pipeline.py`:
- Around line 468-472: The current code globbing self._plots_dir (using
_sanitize(result.tissue) and building plot_paths from files matching
"{safe}_*.png") can include stale images from previous runs; instead, record the
exact plot files written during this invocation on the CascadeResult (e.g.,
add/append to a result.plots list or result.plot_paths when each plot is
written) and use that list to build plot_paths; similarly, stop globbing for
"cross_tissue_heatmap.png" and read it only if it was recorded on the
CascadeResult for this run. Update the plot-writing functions to append
successful writes to the CascadeResult (unique symbols: CascadeResult,
result.plots or result.plot_paths, self._plots_dir, _sanitize,
cross_tissue_heatmap.png) and replace the glob-based construction with iteration
over the recorded list.
- Around line 201-202: The current use of _sanitize(tissue_label) is lossy and
can collapse distinct labels (e.g. "Brain/Cortex" vs "Brain Cortex"); replace
ad-hoc calls with a single-run mapping that generates a collision-resistant slug
per tissue_label (e.g. sanitized base plus a short deterministic hash or UUID
derived from the original tissue_label) and reuse that slug wherever safe_name
or tissue_dir and plot prefixes are produced (locations using _sanitize,
including the code that sets safe_name and tissue_dir and the later usages
around the plot prefix code). Implement a map on the pipeline instance (e.g.
self._tissue_slug_map) populated once when tissues are enumerated and then
reference self._tissue_slug_map[tissue_label] instead of calling _sanitize
directly. Ensure the slug remains stable for the duration of the run and is used
for per-tissue directories and plot prefixes.
- Around line 329-331: The export step ignores self.config.overwrite: before
calling gene_ht.export(tsv_path) check whether the destination exists and honor
self.config.overwrite — if overwrite is true, remove the existing tsv (and any
related compressed variant) using the appropriate filesystem helper (Hail's
hadoop utilities for remote paths or os.unlink for local paths), and if
overwrite is false and the file exists, raise or fail early with a clear
message; then call gene_ht.export(tsv_path). Reference gene_ht, export(),
tsv_path, tissue_dir and self.config.overwrite to locate where to add the
existence check and deletion.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fe20b8a5-776a-43cc-9ad6-8059eb5a192a
📒 Files selected for processing (1)
hvantk/qtlcascade/pipeline.py
|
Hey @copilot |
…d fix test module mocking Agent-Logs-Url: https://github.com/bigbio/hvantk/sessions/7c166da9-3a7e-46e0-a9de-77c6e59835b8 Co-authored-by: enriquea <8707954+enriquea@users.noreply.github.com>
Fixed in commit
Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
DEFAULT_COLOC_H4_THRESHOLDinhvantk/qtlcascade/coloc.pyCLINGEN_HEADER_SKIP_LINESinhvantk/tables/table_builders.pytempfileandosinhvantk/tables/table_builders.pytest_cascade_cmdandtest_coloc_cmdintest_cli_qtlcascade.pyto usepatch.dict(sys.modules, ...)for mockinghailandhvantk.core.hail_context(avoids import failure whenhailis not installed)✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.